FreClu: Efficient Frequency-based De novo Short Read Clustering -- preparation for input data Please try Java VM options -Xms and -Xmx to tune the heap size. E.g. java -Xms10G -Xmx10G Radix_DNAseq Note: One has to compile the java files (javac) before running the program below. javac QV_format1.java javac QV_format1_qseq.java javac QV_format1_fastq.java javac QV_filter.java javac QV_format2.java javac ChangeSolexaQVCountTagToFASTA.java javac OverlapWithoutGap.java javac AlignmentMain.java javac FormatOverlap.java javac Merge_randomModel.java javac Radix_DNAseq.java javac Merge.java javac Radix_num.java Usage: Five steps are included as below. (1) REQUIRED : (1)-1. To join all the raw sequence files Illumina <*_seq.txt > and their QV files Illumina <*_prb.txt >, and remove sequence which has ambiguous base N. The output file will be named as <*_seq-prb.txt>. java QV_format1 <*_seq.txt> OR (1)-2. For Illumina <*qseq.txt> format raw sequence files which have merged sequences and QVs. java QV_format1_qseq <*_qseq.txt> OR (1)-3. For Illumina <*fastq> format raw sequence files which have merged sequences and QVs. java QV_format1_fastq <*fastq> (2) OPTIONAL : For output file <*_seq-prb.txt> of (1), set a QV filter to trim apparently low quality reads; What we used was, at most 4 of the first 20 bases of a read were allowed to have QV < 9. java QV_filter 9 4 20 (3) REQUIRED : For the output file of (2), cut 3' linker sequence. (3)-1. For 5'-end SAGE sample, we trimmed all reads at 25 bases. (If it is not necessary to cut any 3' linker sequence, please try "java QV_format2 ". ) java QV_format2 25 OR (3)-2. For small RNAs sample, we trimmed 3' linker that ovelaped to the linker "TCGTATGCCGTCTTCTGCTTGT" at 3' end with at least 5 bases and over 80 percent similarity if alignment length is less than 11 bases. The output file named "*_format_matchToLinker" should be used for further analysis. Other output files, which are "*_format_linkerAtStart" and "*_format_unmatchToLinker" consist of short reads failed in 3'linker mapping. (3)-2-1. java ChangeSolexaQVCountTagToFASTA (3)-2-2. java AlignmentMain (3)-2-3. java FormatOverlap 5 11 0.8 (4) REQUIRED : (4)-1. Sort redundan reads in the output file of (3) in lexicographical order. java Radix_DNAseq (4)-2. Scan the output file of (4)-1 to count the frequency of individual non-redundant sequence and the sum of expected errors at each base (NOTE: Illumina QV). (4)-2-1. For data with empirical BAC-based DNA substitution file and adjusted QV file , java Merge -q -s OR (4)-2-2. For data without files above, an evenly distributed DNA substitution pattern (1/3) and non-adjusted QV are used. java Merge_randomModel (5) REQUIRED : Sort non-redundant sequences the output file of (4)-2 according to their frequencies in descending order using the radix-sort algorithm. java Radix_num